Creating an African American-Sounding TTS: Guidelines, Technical Challenges, and Surprising Evaluations

Supplementary Audio Material. Submitted to IUI 2024



Table 1: Natural recordings from the selected AA voice. Samples marked with (*) were used as part of Study 3.

Samples (non-synthetic)
1
2 (*)
3 (*)
4 (*)
5


Table 2: Comparison of single- and multi-speaker models. The multi-speaker model was used to generate the final samples for Studies 1 and 2 (see Table 3).

# Single-Speaker Model Multi-Speaker Model Text
1 OK, call us back if you run into any other issues, and enjoy the rest of your afternoon! Bye!
2 Yes, I'm back! Thank you so much for holding! I really appreciate your patience!
3 I'm not really sure what's causing this delay. Looks like the item is in stock? Let me take a closer look.
4 Yes, would you mind holding the line just a bit longer?
Sorry to do this to you again, but I'm having some issues retrieving your account.
5 Um, sorry, but I believe there are no direct flights out of your preferred airport.


Table 3: Synthetic samples used in Studies 1 and 2 for the AA and WH voices.
To prevent listeners from being exposed to the same text more than once, a different set of sentences was used for each voice.

Sample AA WH
01
02
03
04
05
06
07
08
09
10